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Abstract 

Determinantal point processes (DPPs) offer an el¬ 
egant tool for encoding probabilities over subsets 
of a ground set. Discrete DPPs are parametrized 
by a positive semidefinite matrix (called the DPP 
kernel), and estimating this kernel is key to learn¬ 
ing DPPs from observed data. We consider the 
task of learning the DPP kernel, and develop for it 
a surprisingly simple yet effective new algorithm. 

Our algorithm offers the following benefits over 
previous approaches: (a) it is much simpler; (b) it 
yields equally good and sometimes even better lo¬ 
cal maxima; and (c) it runs an order of magnitude 
faster on large problems. We present experimental 
results on both real and simulated data to illustrate 
the numerical performance of our technique. 

1. Introduction 

Determinantal point processes (DPPs) arose in statisti¬ 
cal mechanics, where they were originally used to model 
fermions (Macchi, 1975). Recently, they have witnessed 
substantial interest in a variety of machine learning applica¬ 
tions (Kulesza, 2013; Kulesza and Taskar, 2012). 

One of the key features of DPPs is their ability to model the 
notion of diversity while respecting quality, a concern that 
underlies the broader task of subset selection where balanc¬ 
ing quality with diversity is a well-known issue—see e.g., 
document summarization (Lin and Bilmes, 2012), object 
retrieval (Affandi et al., 2014), recommender systems (Zhou 
et al., 2010), and sensor placement (Krause et al., 2008). 

DPPs are also interesting in their own right: they have 
various combinatorial, probabilistic, and analytic properties, 
while involving a fascinating set of open problems (Lyons, 
2003; Hough et al., 2006; Kulesza, 2013). 

Within machine learning DPPs have found good use—see 
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for instance (Gillenwater et al., 2014); (Kulesza and Taskar, 
2011b); (Kulesza and Taskar, 2011a); (Affandi et al., 2014); 
(Affandi et al., 2103); (Affandi et al., 2013); (Gillenwater 
et al., 2012). For additional references and material we refer 
the reader to the survey (Kulesza and Taskar, 2012). 

Our paper is motivated by the recent work of Gillenwater 
et al. (2014), who made notable progress on the task of 
learning a DPP kernel from data. This task is conjectured 
to be NP-Hard (Kulesza, 2013, Conjecture 4.1). Gillenwa¬ 
ter et al. (2014) presented a carefully designed EM-style 
procedure, which, unlike several previous approaches (e.g., 
(Kulesza and Taskar, 201 lb;a; Affandi et al., 2014)) learns 
a full DPP kernel nonparameterically. 

One main observation of Gillenwater et al. (2014) is that 
applying projected gradient ascent to the DPP log-likelihood 
usually results in degenerate estimates (because it involves 
projection onto the set {X : 0 ^ A ^ /}). Hence one may 
wonder if instead we could apply more sophisticated man¬ 
ifold optimization techniques (Absil et al., 2009; Boumal 
et al., 2014). While this idea is attractive, and indeed ap¬ 
plicable, e.g., via the excellent Manopt toolbox (Boumal 
et al., 2014), empirically it turns out to be computationally 
too demanding; the EM strategy of Gillenwater et al. (2014) 
is more practical. 

We depart from both EM and manifold optimization to de¬ 
velop a new learning algorithm that (a) is simple, yet power¬ 
ful; and (b) yields essentially the same log-likelihood values 
as the EM approach while running significantly faster. In 
particular, our algorithm runs an order of magnitude faster 
on larger problems. 

The key innovation of our approach is a derivation via a 
fixed-point view, which by construction ensures positive 
definiteness at every iteration. Its convergence analysis in¬ 
volves an implicit bound-optimization iteration to ensure 
monotonic ascent. ' A pleasant byproduct of the fixed-point 
approach is that it avoids any eigenvalue/vector computa¬ 
tions, enabling a further savings in running time. 

*The convergence analysis in this version of the paper improves 
upon our original submission, in that our proof is now constructive 
and requires weaker assumptions. 






Fixed-point algorithms for determinantal point processes 


1.1. Background and problem setup 

Without loss of generality we assume that the ground set 
of N items is {1,2,..., N}, which we denote by y. A 
(discrete) DPP on is a probability measure V on 2^ (the 
set of all subsets of y) such that for any Y C y, the prob¬ 
ability V(Y) verifies P(Y) oc. det(Ly); here Ly denotes 
the principal submatrix of the DPP kernel L induced by 
indices in Y. Intuitively, the diagonal entry La of the kernel 
matrix L captures some notion of the importance of item 
i, whereas an off-diagonal entry = Lji measures simi¬ 
larity between items i and j. This intuitive notion provides 
further motivation for seeking DPPs with non-diagonal ker¬ 
nels when there is implicit interaction between the observed 
items. 

The normalization constant for the measure V follows upon 
observing that det(Ly) = det(L -f I). Thus, 


V{Y) 


det(Ly) 
det(L -I- 1) ’ 


Y cy. 


( 1 . 1 ) 


DPPs can also be given an alternative representation through 
a marginal kernel K that captures for a random Y ^ V and 
every A C y, the marginal probability 

V{ACY) =det{KA). (1.2) 

It is easy to verify that K = L{L + I)~^, which also implies 
that K and L have the same eigenvectors and differ only in 
their eigenvalues. It can also be shown (Kulesza, 2013) that 
V(Y) = I dei{K — /yc)|, where /yc is a partial N x N 
identity matrix with diagonal entries in Y zeroed out. 

Both parameterizations (1.1) and (1.2) of the DPP probabil¬ 
ity are useful: Gillenwater et al. (2014) used a formulation 
in terms of AT; we prefer (1.1) as it aligns better with our 
algorithmic approach. 

1.2. Learning the DPP Kernel 


For instance, using projected gradient on (1.4) may seem 
tempting, but projection ends up yielding degenerate (diago¬ 
nal and rank-dehcient) solutions which is undesirable when 
trying to capture interaction between observations—indeed, 
this criticism motivated Gillenwater et al. (2014) to derive 
the EM algorithm. 

We approach problem (1.3) from a different viewpoint 
(which also avoids projection) and as a result obtain a new 
optimization algorithm for estimating L. This algorithm, its 
analysis, and empirical performance are the subject of the 
remainder of the paper. 

2. Optimization algorithm 

The method that we derive has two key components: (i) 
a fixed-point view that helps obtain an iteration that satis¬ 
fies the crucial positive definiteness constraint L A Oby 
construction; and (ii) an implicit bound optimization based 
analysis that ensures monotonic ascent. The resulting algo¬ 
rithm is vastly simpler than the previous EM-style approach 
of Gillenwater et al. (2014). 

If |E| = k, then for a suitable N x k indicator matrix 
U we can write Ly = U* LU, which is also known as 
a compression (U* denotes the Hermitian transpose). We 
write U*LUi interchangeably with Ly., implicitly assuming 
suitable indicator matrices Ui such that U*Ui = f|y|- To 
reduce clutter, we will drop the subscript on the identity 
matrix, its dimension being clear from context. 

Denote by (j){L) the objective function in (1.3). Assume for 
simplicity that the constraint set is open, i.e., L y 0. Then 
any critical point of the log-likelihood must satisfy 

V(p(L) = 0, or equivalently 

(O ]\ 

y” U^(U*LU,y^U*-n(I + L}-^ =0. 

Any (strictly) positive dehnite solution to the nonlinear ma¬ 
trix equation (2.1) is a candidate locally optimal solution. 


The learning task aims to ht a DPP kernel (either L or equiv¬ 
alently the marginal kernel K) consistent with a collection 
of observed subsets. Suppose we obtain as training data n 
subsets (Yi,..., y„) of the ground set y. The task is to max¬ 
imize the likelihood of these observations. Two equivalent 
formulations of this maximization task may be considered: 

E TL 

logdet(Ly) — nlogdet(/-I-L), (1.3) 

_ i=l 


We solve this matrix equation by developing a fixed-point 
iteration. In particular, dehne 

^'■=1 Eili [/*-(/ + L )-\ 

with which we may equivalently write (2.1) as 

A-|-L-i = L-\ (2.2) 

Equation (2.2) suggests the following iteration 

+ fc = 0,l,.... (2.3) 


We will use formulation (1.3) in this paper. Gillenwater 
et al. (2014) used (1.4) and exploited its structure to derive 
a somewhat intricate EM-style method for optimizing it. 
Both (1.3) and (1.4) are nonconvex and difficult optimize. 


A priori there is no reason for iteration (2.3) to be valid 
(i.e., converge to a stationary point). But we write it in this 
form to highlight its crucial feature: starting from an initial 
Lq a 0, it generates positive dehnite iterates (Prop. 2.1). 
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Proposition 2.1. Let Lq >- 0. Then, the sequence {L}T\}.^i 
generated by (2.3) remains positive definite. 

Proof. The proof is by induction. It suffices to show that 
L >- 0 + A ^ 0. 

Since I + L >- L, from the order inversion property of 
the matrix inverse map it follows that 
Now adding the matrix ^ J2i=i {U*LUi) ^ Uf y 0 we 
obtain the desired inequality by dehnition of A. □ 

A quick experiment reveals that iteration (2.3) does not 
converge to a local maximizer of (^(L). To fix this defect, 
we rewrite the key equation (2.2) in a different manner: 

L = L + LAL. (2.4) 

This equation is obtained by multiplying (2.2) on the left 
and right by L. Therefore, we now consider the iteration 

Lk+i Lk + LkAkLk, fc = 0,1,.... (2.5) 

Prop. 2.1 in combination with the fact that congruence pre¬ 
serves positive definiteness (i.e., if A ^ 0, then Z*XZ A 0 
for any complex matrix Z), implies that if Lq >- 0, then 
the sequence {Lk}k>i obtained from iteration (2.5) is also 
positive definite. What is more remarkable is that contrary 
to iteration (2.3), the sequence generated by (2.5) monoton- 
ically increases the log-likelihood. 

While monotonicity is not apparent from our intuitive deriva¬ 
tion above, it becomes apparent once we recognize an im¬ 
plicit change of variables that seems to underlie our method. 

2.1. Convergence Analysis 

Theorem 2.2. Let L^ be generated via (2.5). Then, the 
sequence {(()(Tfc)}fc>o monotonically increasing. 

Before proving Theorem 2.2 we need the following lemma. 

Lemma 2.3. Let U S (k < N) such that U*U = I. 
The map g{S) := \ogdet{U*S~^U) is convex on the set of 
positive definite matrices. 

Proof. Since g is continuous it suffices to establish midpoint 
convexity. Consider therefore, X,Y 0 and let 

be their geometric mean. The operator inequality X^Y A 
is well-known (Bhatia, 2007, Thm. 4.1.3). Hence, 

A(A#y)-i = A-i#y-i 
U* u y U*{X-^#Y-^)U 

A {U*X-'^U)#{U*Y-^U), 


where equality follows from (Bhatia, 2007, Thm. 4.1.3), and 
the final inequality follows from (Bhatia, 2007, Thm. 4.1.5)^ 
Since log det is monotonic on positive definite matrices and 
since det{A=ffB) = Vdet As /det B, it then follows that 

logdet(C/* (^)”V) < ilogdet(C/*A-i[/) 

-f |logdet((7*y-^t/), 

which proves the lemma. □ 

Now we are ready to prove Theorem 2.2. 

Proof (Thm. 2.2). The key insight is to consider S = L~^ 
instead of L; this change is only for the analysis—the actual 
iteration that we implement is still (2.5).^ 

Writing il)(S) := 4>{L), we see that ^’(<5') equals 

i logdetiU*S-^U,) - logdet(.S-i + I) 

= logdet(S') -f i^^logdet(t7*S'"^[/i) 

— log det(/ -I- S). 

L&t h{S) = ^jlogdet({7*S'“^t/i)—logdet(/-|-S'), and 

f{S) = logdet(5'). Clearly, / is concave in S, while h is 
convex is S'; the latter from Lemma 2.3 and the fact that 
— log det(/ -I- S) is convex. This observation allows us to 
invoke iterative bound-optimization (an idea that underlies 
EM, CCCP, and other related algorithms). 

In particular, we construct an auxiliary function ^ so that 
tlt{S)>i(S,R), VS,i?>-0, 

V^(S)=e(S,S), vs^o. 

As in (Yuille and Rangarajan, 2003), we select ^ by exploit¬ 
ing the convexity of h: as h{S) > h(R) + {Xh{R), S — R), 
we simply set 

C(S, R) := f{S) + h{R) + (Vh(i?) I S - i?). 

Given an iterate Sk, we then obtain Sfc+i by solving 

Sk+i :=argmaxg^o ^(S,Sk), (2.6) 

which clearly ensures monotonicity: tp(Sk+i) > '4’{Sk). 

Since (2.6) has an open set as a constraint and <^(S, •) is 
strictly concave, to solve (2.6) it suffices to solve the neces¬ 
sary condition Xs^{S, Sk) = 0. This amounts to 

= (i+Sk)-^ + kY^^syufu:syur^u:sy. 

Rewriting in terms of L we immediately see that with 
Lk+i = Lk + LkAkLk, (l>{Lk+i) > 4>{Lk) (the inequality 
is strict unless Lk+i = Lk). □ 

^For an explicit proof see (Sra and Hosseini, 2015, Thm. 8). 
^Our previous proof was based on viewing iteration (2.5) as a 
scaled-gradient-like iteration. However, we find the present version 
more transparent for proving monotonicity. 
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Theorem 2.2 shows that iteration (2.5) is well-defined (posi¬ 
tive definiteness was established by Prop. 2.1). The fixed- 
point formulation (2.5) actually suggests a broader iteration, 
with an additional step-size a: 

^k+l — O.Ij}zl^kLk- ( 2 . 7 ) 

Above we showed that for a = 1 ascent is guaranteed. Em¬ 
pirically, a > 1 often works well; Prop. A.l presents an 
easily computable upper bound on feasible a. We conjec¬ 
ture that for all feasible values a > 1, iteration (2.5) is 
guaranteed to increase the log-likelihood. 

Moreover, all previous calculations can be redone in the 
context where L = F*WF for a fixed feature matrix F in 
order to learn the weight matrix W (under the assumption 
that S*S is invertible), making our approach also useful in 
the context of feature-based DPP learning. 

Pseudocode of our resulting learning method is presented in 
Algorithms 1 and 2. For simplicity, we recommend using a 
fixed value of a (which can be set at initialization). 

Algorithm 1 Picard Iteration 

Input: Matrix L, training set T, step-size a > 0. 
for i = 1 to maxiter do 

L i — FixedPointMap(I/, T, a) 
if stop(L, T, i) then 
break 
end if 
end for 
return L 


Algorithm 2 FixedPointMap 

Input: Matrix L, training set T, step-size a > 0 
Z <— 0 
for y in T do 
Zy = Zy -t- Ly^ 
end for 

return L + aL{Z/\T\ - {L + I)~^)L 


2.2. Iteration cost and convergence speed 

The cost of each iteration of our algorithm is dominated by 
the computation of A, which costs a total of l^i P + 

N^) = 0{nK^ + N^) arithmetic operations, where k = 
maxi \ Yi\', the 0{\Yi\^) cost comes from the time required 
to compute the inverse Ly ^, while the cost stems from 
computing (/ + L)~^. Moreover, additional costs arise 
when computing LAL. 

In comparison, each iteration of the method of Gillenwater 
et al. (2014) costs 0{nN+ N^), which is comparable 
to, though slightly greater than 0{nK^ -f N^) as N > k. In 
applications where the sizes of the sampled subsets satisfy 
K ^ TV, the difference can be more substantial. More¬ 
over, we do not need any eigenvalue/vector computations to 
implement our algorithm. 


Finally, our iteration also runs slightly faster than the K- 
Ascent iteration, which costs 0{nN^). Additionally, simi¬ 
larly to EM, our algorithm avoids the projection step neces¬ 
sary in the K-Ascent algorithm (which ensures K G {X : 
0 A A ^ /}). As shown in (Gillenwater et al., 2014), 
avoiding this step helps learn non-diagonal matrices. 

We note in passing that similar to EM, assuming a non¬ 
singular local maximum, we can also obtain a local linear 
rate of convergence. This follows by relating iteration (2.5) 
to scaled-gradient methods (Bertsekas, 1999, §1.3) (except 
that we have an implicit PSD constraint). 

3. Experimental results 

We compare performance of our algorithm, referred to 
as Picard iteration^, against the EM algorithm presented 
in Gillenwater et al. (2014). We experiment on both syn¬ 
thetic^ and real-world data. 

For real-world data, we use the baby registry test on which 
results are reported in (Gillenwater et ah, 2014). This dataset 
consists in 111, 006 sub-registries describing items across 
13 different categories; this dataset was obtained by col¬ 
lecting baby registries from amazon . com, all containing 
between 5 and 100 products, and then splitting each registry 
into subregistries according to which of the 13 categories 
(such as “feeding”, “diapers”, “toys”, etc.) each product in 
the registry belongs to. (Gillenwater et ah, 2014) provides a 
more in-depth description of this dataset. 

These sub-registries are used to learn a DPP capable of 
providing recommendations for these products: indeed, a 
DPP is well-suited for this task as it provides sets of products 
in a category that are popular yet diverse enough to all be of 
interest to a potential customer. 

3.1. Implementation details 

We measure convergence by testing the relative change 
used a tighter convergence crite¬ 
rion for our algorithm (£pic = 0.5-eem) to account for the fact 
that the distance between two subsequent log-likelihoods 
tends to be smaller for the Picard iteration than for EM. 

The parameter a for Picard was set at the beginning of 
each experiment and never modified as it remained valid 
throughout each test^. In EM, the step size was initially 

"'Our nomenclature stems from the usual name for such itera¬ 
tions in fixed-point theory (Granas and Dugundji, 2003). 

^The figures and tables for the synthetic results have been 
modified to include some minor corrections: in particular. Tables 1 
and 2 now show the runtime to 99%. The runtimes were initially 
to final convergence, but erroneously reported to be to 95%. 

^Although it was not necessary in our experiments, if the pa¬ 
rameter a becomes invalid, it can be halved until it reaches 1. 
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set to 1 and halved when necessary, as per the algorithm 
described in (Gillenwater et ah, 2014); we used the code 
of Gillenwater et al. (2014) for our EM implementation^. 

3.2. Synthetic tests 

In each experiment, we sample n training sets from a base 
DPP of size N, then learn the DPP using EM and the Picard 
iteration. We initialize the learning process with a random 
positive dehnite matrix Lq (or Kq for EM) drawn from the 
same distribution as the true DPP kernel. 

Specifically, we used two matrix distributions to draw the 
true kernel and the initial matrix values from: 

• BASIC: We draw the coefficients of a matrix M from 
the uniform distribution over [O, v^], then return L = 
MM^ conditioned on its positive dehniteness. 

• WISMART: We draw L from a Wishart distribution 
with N degrees of freedom and an identity covariance 
matrix, and rescale it with a factor 

Eigures 1, 2 and 3 show the log-likelihood as a function of 
time for different parameter values when both the true DPP 
kernel and the initial matrix Lq were drawn from the BAS IC 
distribution. Tables 1 and 2 show the final log-likelihood 
and the time necessary for each method to reach 99% of the 
optimal log likelihood for both distributions and parameters 
n = 5000, a = 5. 

As shown in Figure 1, the difference in time necessary for 
both methods to reach a good approximation of the final 
likelihood (as dehned by best hnal likelihood) grows drasti¬ 
cally as the size N of the set of all elements {1,2,..., N} 
increases. Figure 2 illustrates the same phenomenon when 
N is kept constant and n increases. 

Finally, the influence of the parameter a on convergence 
speed is illustrated in Figure 3* *. Increasing a noticeably in¬ 
creases Picard’s convergence speed, as long as the matrices 
remain positive definite during the Picard iteration. 

The greatest strength of the Picard iteration lies in its initial 
rapid convergence: the log-likelihood increases significantly 
faster for the Picard iteration than for EM. Although for 
small datasets EM sometimes performs better, our algorithm 
provides substantially better results in shorter timeframes 
when dealing with larger datasets. 

Overall, our algorithm converges to 99% of the optimal log- 
likelihood (defined as the maximum of the log-likelihoods 
returned by each algorithm) significantly faster than the EM 

’These experiments were run with MATLAB, on a Linux Mint 
system, using 16GB of RAM and an i7-4710HQ CPU @ 2.50GHz. 

*In the cases where a > 1, a safeguard was added to check that 
the matrices returned by our algorithm were positive definite. 


Table 1. Final log-likelihoods and time necessary for an iteration 
to reach 99% of the optimal log likelihood for both algorithms 
when using BAS IC distribution for trae and initialization matrices 
(training set size of 5,000, a = 5). 



Log-Likelihood 

Runtime to 99% 

Picard 

EM 

Picard 

EM 

A = 50 

-15.5 

-15.5 

17.3s 

30.7s 

N = 100 

-24.4 

-24.2 

143s 

75.5s 

N = 150 

-32.5 

-32.5 

40.7s 

84.0s 

N = 200 

-40.8 

-41.2 

51.Is 

1,730s 

N = 250 

-45.7 

-46.0 

99.1s 

2,850s 


Table 2. Final log-likelihoods and time necessary for an iteration to 
reach 99% of the optimal log likelihood for both algorithms when 
using WISHART distribution for true and initialization matrices 
(training set size of 5,000, a = 5). 



Log-Likelihood 

Runtime to 99% 

Picard 

EM 

Picard 

EM 

A = 50 

-33.0 

-33.1 

0.2s 

2.0s 

A = 100 

-66.2 

-66.2 

0.5s 

3.6s 

A = 150 

-99.2 

-99.3 

0.8s 

5.2s 

A = 200 

-132.1 

-132.4 

1.2s 

8.9s 

A = 250 

-165.1 

-165.7 

2.5s 

11s 


algorithm for both distributions, particularly when dealing 
with large values of N. 

Thus, the Picard iteration is preferable when dealing with 
large ground sets; it is also very well-suited to cases where 
larger amounts of training data are available. 

3.3. Baby registries experiment 

We tested our implementation on all 13 product categories in 
the baby registry dataset, using two different initializations: 

• the aforementioned Wishart distribution 

• the data-dependent moment matching initialization 
(MM) described in (Gillenwater et al., 2014) 

In each case, 70% of the baby registries in the product cate¬ 
gory were used for training; 30% served as test. The results 
presented in Figures 4 and 5 are averaged over 5 learning 
trials, each with different initial matrices; the parameter a 
was set equal to 1.3 for all iterations. 

Similarly to its behavior on synthetic datasets, the Picard it¬ 
eration provides overall significantly shorter mntimes when 
dealing with large matrices and training sets. As shown in 
Table 3, the final log-likelihoods are very close (on the order 
10“^ to 10“^) to those attained by the EM algorithm. 

Using a moments-matching initialization leaves Picard’s 
mntimes overall unchanged (a notable exception being the 
‘gear’ category). However, EM’s runtime decreases dras¬ 
tically with this initialization, although it remains signifi¬ 
cantly longer than Picard’s in most categories. 
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(a) TV = 50 (h) N ^ 100 (c) N = 150 

Figure 1. Normalized log-likelihood as a function of time for various set sizes N, with n = 5000 and a = 5 using the BASIC random 
distribution. 



Figure 2. Normalized log likelihood as a function of time for various numbers of training sets, with = 50 and a = 5 using the BASIC 
random distribution. 





(a) a = 1 


(b) a = 5 


(c) a = 10 


Figure 3. Normalized log likelihood as a function of time for different values of a, with N = 50 and n = 5000 using the BASIC random 
distribution. 


The final log-likelihoods are also closer when using 
moments-matching initialization (see Table 3). 

4. Conclusions and future work 

We approached the problem of maximum-likelihood estima¬ 
tion of a DPP kernel from a different angle: we analyzed the 
stationarity properties of the cost function and used them 


to obtain a novel fixed-point Picard iteration. Experiments 
on both simulated and real data showed that for a range of 
ground set sizes and number of samples, our Picard itera¬ 
tion runs remarkably faster that the previous best approach, 
while being extremely simple to implement. In particular, 
for large ground set sizes our experiments show that our 
algorithm cuts down runtime to a fraction of the previously 
optimal EM runtimes. 



























Fixed-point algorithms for determinantal point processes 


feeding 
gear 
diaper 
bedding 
apparei 
bath 
toys 
heaith 
media 
stroilers 
safety 
carseats 
furniture 

0 2 4 6 8 10 12 14 

Normalized NLL 



feeding 

gear 

diaper 

bedding 

apparei 

bath 

toys 

heaith 

media 

stroiiers 

safety 

carseats 

furniture 



0 10 20 30 40 50 60 

time (seconds) 



Picard 


EM 


70 80 90 
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Figure 4. Evaluation of EM and the Picard iteration on the baby registries dataset using Wishart initialization. 
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(a) Einal negative log-likelihood 


(b) Runtime 


Figure 5. Evaluation of EM and the Picard iteration on the baby registries dataset using moments-matching initialization. 


We presented a theoretical analysis of the convergence prop¬ 
erties of the Picard iteration, and found sufficient conditions 
for its convergence. However, our experiments reveal that 
the Picard iteration converges for a wider range of step- 
sizes (parameter a in the iteration and plots) than currently 
accessible to our theoretical analysis. It is a part of our 
future work to develop more complete convergence theory, 
especially because of its strong empirical performance. 

In light of our results, another line of future work is to apply 
fixed-point analysis to other DPP learning tasks. 
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A. Bound on a 

Proposition A.l. Let L, Ui, and A be as defined above; set 
Z — ^ Ui {U*LUi) ^ U*. Define the constant 

T ■“ niax'[Amin(7.^), 1/A max (/ + L)}. (A.l) 

Then, 0 < 7 < 1 and for a < (1 — 7 )”^ the update 

L' L + aLAL 

ensures that L' is also positive definite. 


Proof. Let Z = i U. {UlLUi)-^ U*. 
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Table 3. Comparison of final log-likelihoods for both algorithms; 
relative closeness between Picard and EM: <5 = |0em — <^pic|/<?!>cm- 


Category 

5 (WiSHART) 

5 (MM) 

FURNITURE 

4.4E-02 

1.2E-03 

CARSEATS 

3.7E-02 

7.6E-04 

SAFETY 

3.3E-02 

8.0E-04 

STROLLERS 

3.9E-02 

3.0E-03 

MEDIA 

2.3E-02 

2.4E-03 

HEALTH 

2.6E-02 

7.4E-03 

TOYS 

2.0E-02 

5.9E-03 

BATH 

2.6E-02 

2.9E-03 

APPAREL 

9.2E-03 

4.3E-03 

BEDDING 

1.3E-02 

7.6E-03 

DIAPER 

7.2E-03 

5.3E-03 

GEAR 

2.3E-03 

9.0E-03 

FEEDING 

4.9E-04 

2.1E-03 


To ensure L + aLAL >~ Owe. equivalently show 

71 

E u:-{l +/)-!) ^ 0 

I+ >-aL{L +I)~'^ 

(l-a)/ + o(/-fL)-i + oL^/^ZLi/VO 

(asL(L + /)-i =/ - (/ + L)-i) 

(1 - a) + aA„un((/ + L)-^ + > 0. 

This inequality can be numerically optimized to find the 
largest feasible value of a. The simpler bound in question 
can be obtained by noting that 

Xn^iniH + L)-^ + 

E 1/A max (/ + L)} = 7. 

Thus, we have the easily computable bound for feasible a: 


Clearly, by construction 7 > 0. To see why 7 < 1, observe 
that {I + L) -< /, so that Aniin((-f + < 1- Further, 

block-matrix calculations show that Z ^ L~^, whereby 

A„,i„(Li/2ZLi/2) < A^i„(/) = 1. □ 
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